This notebook describes the process we use to compare our synthetic microdata mimicking The Bureau of Labor Statistics' (BLS) QCEW dataset to the original BLS dataset.
The original data can be found here: https://www.bls.gov/cew/downloadable-data-files.htm by clicking on the quarterly, by area csv for the year 2016.
Upon downloading the file you may notice the difference in the directory structure of the BLS' data.
For example, for New Jersey (FIPS: 34), you will see that each county is split into its own csv,
'2016.q1-q4 34001 Atlantic County, New Jersey.csv'
'2016.q1-q4 34003 Bergen County, New Jersey.csv'
'2016.q1-q4 34005 Burlington County, New Jersey.csv'
...
whereas the generated synthetic data contains all counties in a single state file with cnty as a column.
First I want to make sure that all the counties are included in the generated microdata. This shouldn't be an issue but its safe to check.
Lets Check for New Jersey:
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
import numpy as np
import os
try:
df = pd.read_csv("../OrigQCEWStudy/nj34_qdb_2016_1.csv")
distinct_cnty = df['cnty'].unique()
print("Distinct cnty values:")
print(sorted(distinct_cnty))
except FileNotFoundError:
print("Error: File not found.")
Distinct cnty values: [np.int64(1), np.int64(3), np.int64(5), np.int64(7), np.int64(9), np.int64(11), np.int64(13), np.int64(15), np.int64(17), np.int64(19), np.int64(21), np.int64(23), np.int64(25), np.int64(27), np.int64(29), np.int64(31), np.int64(33), np.int64(35), np.int64(37), np.int64(39), np.int64(41)]
Comparing this list to the BLS' files for New Jersey shows that we have all the counties.
Now, to compare some basic summary statistics of both datasets, lets reformat the BLS data to only include the features that are generated in our microdata
First We will combine all of the county files into a single state file:
For now we are only looking at New Jersey
def load_combined_data(directory):
dfs = []
for filename in os.listdir(directory):
if filename.endswith('.csv'):
filepath = os.path.join(directory, filename)
# Read CSV file
df = pd.read_csv(filepath)
dfs.append(df)
# Combine all DataFrames
combined_df = pd.concat(dfs, ignore_index=True)
return combined_df
real_nj_data = load_combined_data('./TrueQCEW/NewJersey/')
Our generated microdata only looks at the first quarter.
Lets filter out any rows where qtr is not 1
real_nj_data = real_nj_data[real_nj_data['qtr'] == 1].copy()
In our generated microdata we have separate columns state and cnty
state is simply the first two digits of the area_fips code in the QCEW datasetcnty is the next three digitsLets create the state and cnty columns and remove any irrelevant columns
real_nj_data['state'] = real_nj_data['area_fips'].astype(str).str[:2] # First 2 digits
real_nj_data['cnty'] = real_nj_data['area_fips'].astype(str).str[2:] # Last 3 digits
# Remove the area_fips column
real_nj_data = real_nj_data.drop(columns=['area_fips'])
# List of columns to keep
columns_to_keep = [
'year',
'qtr',
'state',
'cnty',
'own_code',
'industry_code',
'month1_emplvl',
'month2_emplvl',
'month3_emplvl',
'total_qtrly_wages',
'avg_wkly_wage',
'agglvl_code',
'disclosure_code'
]
real_nj_data = real_nj_data[columns_to_keep]
To make computations easier on the brain we will also rename some of these columns to match our synthetic data:
month1_emplvl -> m1emp
month2_emplvl -> m2emp
month3_emplvl -> m3emp
real_nj_data = real_nj_data.rename(columns={'month1_emplvl':'m1emp',
'month2_emplvl':'m2emp',
'month3_emplvl':'m3emp'
})
Lets also load in the synthetic dataset for New Jersey
synth_nj_df = pd.read_csv('./nj34_qdb_2016_1.csv')
You'll notice that in the actual QCEW dataset we are given industry codes of various digit lengths. These correspond with the North American Industry Classification System (NAICS).
In the synthetic dataset, we give naics, naics3, naics4, naics5 and naics_sector.
the 6 digit naics code is the most specific industry classification for the establishment entry whereas naics_sector will be the most general industry classification.
For now, lets look at both datasets, but only for a specific sector code. Lets say 22. This means that we will only be interested in entries of the QCEW data where the first two digits of industry_code is 22 and entries of the synthetic dataset where naics_sector=22
Since our synthetic data only includes establishments with private ownership, we also filter out entries where own_code is not 5
https://www.bls.gov/cew/classifications/ownerships/ownership-titles.htm
synth_nj_df_22 = synth_nj_df[synth_nj_df['naics_sector'] == 22]
synth_nj_df_22.head()
# Group by county and sum the employment columns
# Group by the specified columns and sum the employment columns
synth_nj_df_22_sums = synth_nj_df_22.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics_sector']
).agg({
'm1emp': 'sum',
'm2emp': 'sum',
'm3emp': 'sum'
}).reset_index()
synth_nj_df_22_sums
| year | qtr | state | cnty | own | naics_sector | m1emp | m2emp | m3emp | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016 | 1 | 34 | 1 | 5 | 22 | 755 | 753 | 750 |
| 1 | 2016 | 1 | 34 | 3 | 5 | 22 | 1089 | 1088 | 1085 |
| 2 | 2016 | 1 | 34 | 5 | 5 | 22 | 1884 | 1258 | 639 |
| 3 | 2016 | 1 | 34 | 7 | 5 | 22 | 320 | 324 | 323 |
| 4 | 2016 | 1 | 34 | 9 | 5 | 22 | 207 | 172 | 136 |
| 5 | 2016 | 1 | 34 | 11 | 5 | 22 | 236 | 171 | 97 |
| 6 | 2016 | 1 | 34 | 13 | 5 | 22 | 703 | 702 | 708 |
| 7 | 2016 | 1 | 34 | 15 | 5 | 22 | 272 | 272 | 274 |
| 8 | 2016 | 1 | 34 | 17 | 5 | 22 | 995 | 993 | 991 |
| 9 | 2016 | 1 | 34 | 19 | 5 | 22 | 192 | 148 | 99 |
| 10 | 2016 | 1 | 34 | 21 | 5 | 22 | 643 | 584 | 527 |
| 11 | 2016 | 1 | 34 | 23 | 5 | 22 | 768 | 771 | 772 |
| 12 | 2016 | 1 | 34 | 25 | 5 | 22 | 1292 | 1293 | 1299 |
| 13 | 2016 | 1 | 34 | 27 | 5 | 22 | 568 | 564 | 561 |
| 14 | 2016 | 1 | 34 | 29 | 5 | 22 | 2826 | 1968 | 1103 |
| 15 | 2016 | 1 | 34 | 31 | 5 | 22 | 1303 | 1074 | 848 |
| 16 | 2016 | 1 | 34 | 33 | 5 | 22 | 2163 | 2147 | 2125 |
| 17 | 2016 | 1 | 34 | 35 | 5 | 22 | 4113 | 2371 | 631 |
| 18 | 2016 | 1 | 34 | 37 | 5 | 22 | 186 | 186 | 185 |
| 19 | 2016 | 1 | 34 | 39 | 5 | 22 | 1565 | 1566 | 1561 |
| 20 | 2016 | 1 | 34 | 41 | 5 | 22 | 223 | 209 | 199 |
real_nj_22 = real_nj_data[
(real_nj_data['industry_code'].str[:2]=='22') &
(real_nj_data['industry_code'].astype(str).str.len() == 2) &
(real_nj_data['own_code'] == 5)
][['year', 'qtr', 'state', 'cnty', 'industry_code','own_code', 'm1emp', 'm2emp', 'm3emp']].copy()
real_nj_22
| year | qtr | state | cnty | industry_code | own_code | m1emp | m2emp | m3emp | |
|---|---|---|---|---|---|---|---|---|---|
| 1088 | 2016 | 1 | 34 | 011 | 22 | 5 | 0 | 0 | 0 |
| 6932 | 2016 | 1 | 34 | 029 | 22 | 5 | 1081 | 1079 | 1080 |
| 14088 | 2016 | 1 | 34 | 027 | 22 | 5 | 518 | 515 | 518 |
| 21256 | 2016 | 1 | 34 | 037 | 22 | 5 | 179 | 179 | 173 |
| 27396 | 2016 | 1 | 34 | 023 | 22 | 5 | 871 | 866 | 865 |
| 35068 | 2016 | 1 | 34 | 033 | 22 | 5 | 0 | 0 | 0 |
| 39320 | 2016 | 1 | 34 | 031 | 22 | 5 | 0 | 0 | 0 |
| 46440 | 2016 | 1 | 34 | 041 | 22 | 5 | 0 | 0 | 0 |
| 52276 | 2016 | 1 | 34 | 025 | 22 | 5 | 1435 | 1436 | 1425 |
| 60252 | 2016 | 1 | 34 | 013 | 22 | 5 | 614 | 614 | 616 |
| 67596 | 2016 | 1 | 34 | 009 | 22 | 5 | 155 | 154 | 152 |
| 72956 | 2016 | 1 | 34 | 035 | 22 | 5 | 285 | 287 | 290 |
| 79564 | 2016 | 1 | 34 | 039 | 22 | 5 | 1314 | 1303 | 1300 |
| 87580 | 2016 | 1 | 34 | 005 | 22 | 5 | 574 | 567 | 566 |
| 95032 | 2016 | 1 | 34 | 007 | 22 | 5 | 415 | 405 | 376 |
| 102564 | 2016 | 1 | 34 | 001 | 22 | 5 | 711 | 709 | 730 |
| 108700 | 2016 | 1 | 34 | 017 | 22 | 5 | 0 | 0 | 0 |
| 115980 | 2016 | 1 | 34 | 003 | 22 | 5 | 1163 | 1163 | 1163 |
| 123760 | 2016 | 1 | 34 | 015 | 22 | 5 | 256 | 256 | 259 |
| 130820 | 2016 | 1 | 34 | 021 | 22 | 5 | 545 | 538 | 537 |
| 137532 | 2016 | 1 | 34 | 019 | 22 | 5 | 77 | 77 | 78 |
Ok this is great! We can now see the employment values for naics sector 22 for each county. Lets throw this into a single dataframe and see these side by side. Instead of viewing two separate dataframes, Lets look at the differences in m1,m2,and m3emp for each county over the 2 digit naics.
# Convert cnty in real_nj_22 to integer to match synth_nj_df_22_sums format
real_nj_22['cnty'] = real_nj_22['cnty'].astype(int)
real_nj_22['state'] = real_nj_22['state'].astype(int)
# Merge the dataframes on county
merged_df = pd.merge(
synth_nj_df_22_sums,
real_nj_22,
on=['year', 'qtr', 'state', 'cnty'],
suffixes=('_synth', '_real')
)
# Calculate differences
merged_df['m1emp_diff'] = merged_df['m1emp_synth'] - merged_df['m1emp_real']
merged_df['m2emp_diff'] = merged_df['m2emp_synth'] - merged_df['m2emp_real']
merged_df['m3emp_diff'] = merged_df['m3emp_synth'] - merged_df['m3emp_real']
# Select relevant columns for the result
result_df_22 = merged_df[['year', 'qtr', 'state', 'cnty',
'm1emp_synth', 'm1emp_real', 'm1emp_diff',
'm2emp_synth', 'm2emp_real', 'm2emp_diff',
'm3emp_synth', 'm3emp_real', 'm3emp_diff']]
# If you want to see counties with the largest absolute differences first
result_df_22 = result_df_22.sort_values(by='cnty', key=abs, ascending=True)
result_df_22
| year | qtr | state | cnty | m1emp_synth | m1emp_real | m1emp_diff | m2emp_synth | m2emp_real | m2emp_diff | m3emp_synth | m3emp_real | m3emp_diff | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016 | 1 | 34 | 1 | 755 | 711 | 44 | 753 | 709 | 44 | 750 | 730 | 20 |
| 1 | 2016 | 1 | 34 | 3 | 1089 | 1163 | -74 | 1088 | 1163 | -75 | 1085 | 1163 | -78 |
| 2 | 2016 | 1 | 34 | 5 | 1884 | 574 | 1310 | 1258 | 567 | 691 | 639 | 566 | 73 |
| 3 | 2016 | 1 | 34 | 7 | 320 | 415 | -95 | 324 | 405 | -81 | 323 | 376 | -53 |
| 4 | 2016 | 1 | 34 | 9 | 207 | 155 | 52 | 172 | 154 | 18 | 136 | 152 | -16 |
| 5 | 2016 | 1 | 34 | 11 | 236 | 0 | 236 | 171 | 0 | 171 | 97 | 0 | 97 |
| 6 | 2016 | 1 | 34 | 13 | 703 | 614 | 89 | 702 | 614 | 88 | 708 | 616 | 92 |
| 7 | 2016 | 1 | 34 | 15 | 272 | 256 | 16 | 272 | 256 | 16 | 274 | 259 | 15 |
| 8 | 2016 | 1 | 34 | 17 | 995 | 0 | 995 | 993 | 0 | 993 | 991 | 0 | 991 |
| 9 | 2016 | 1 | 34 | 19 | 192 | 77 | 115 | 148 | 77 | 71 | 99 | 78 | 21 |
| 10 | 2016 | 1 | 34 | 21 | 643 | 545 | 98 | 584 | 538 | 46 | 527 | 537 | -10 |
| 11 | 2016 | 1 | 34 | 23 | 768 | 871 | -103 | 771 | 866 | -95 | 772 | 865 | -93 |
| 12 | 2016 | 1 | 34 | 25 | 1292 | 1435 | -143 | 1293 | 1436 | -143 | 1299 | 1425 | -126 |
| 13 | 2016 | 1 | 34 | 27 | 568 | 518 | 50 | 564 | 515 | 49 | 561 | 518 | 43 |
| 14 | 2016 | 1 | 34 | 29 | 2826 | 1081 | 1745 | 1968 | 1079 | 889 | 1103 | 1080 | 23 |
| 15 | 2016 | 1 | 34 | 31 | 1303 | 0 | 1303 | 1074 | 0 | 1074 | 848 | 0 | 848 |
| 16 | 2016 | 1 | 34 | 33 | 2163 | 0 | 2163 | 2147 | 0 | 2147 | 2125 | 0 | 2125 |
| 17 | 2016 | 1 | 34 | 35 | 4113 | 285 | 3828 | 2371 | 287 | 2084 | 631 | 290 | 341 |
| 18 | 2016 | 1 | 34 | 37 | 186 | 179 | 7 | 186 | 179 | 7 | 185 | 173 | 12 |
| 19 | 2016 | 1 | 34 | 39 | 1565 | 1314 | 251 | 1566 | 1303 | 263 | 1561 | 1300 | 261 |
| 20 | 2016 | 1 | 34 | 41 | 223 | 0 | 223 | 209 | 0 | 209 | 199 | 0 | 199 |
For rows where the real values are 0 or supressed (still 0 in the dataset) it's hard to tell what's going on...
Lets remove those rows and compute the mean and median difference in m1,m2,and m3emp.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Summary stats for differences
diff_stats = result_df_22[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']].describe()
print("Difference Statistics:")
print(diff_stats)
Difference Statistics:
m1emp_diff m2emp_diff m3emp_diff
count 21.000000 21.000000 21.000000
mean 576.666667 403.142857 227.857143
std 1001.245691 676.409734 520.148083
min -143.000000 -143.000000 -126.000000
25% 16.000000 16.000000 -10.000000
50% 98.000000 71.000000 23.000000
75% 995.000000 691.000000 199.000000
max 3828.000000 2147.000000 2125.000000
These results look decent. We can see that for the specific naics sector (22), the median difference between the synthetic true employment counts are for the most part under 100. Though there isn't much data to tell us about the bigger picture. To fix this lack of data, lets scale this problem up and look at the median differences of employment values over all 2-digit naics.
real_nj_data_2dig = real_nj_data[
(real_nj_data['industry_code'].str.len() == 2) &
(real_nj_data['own_code'] == 5)
]
synth_nj_df_2dig = synth_nj_df.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics_sector']
).agg({
'm1emp': 'sum',
'm2emp': 'sum',
'm3emp': 'sum'
}).reset_index()
Some industry codes that appear in the synthetic dataset but not the real. This is because some of the two digit codes appear in the real dataset as ranges of codes. For example NAICS 31-33 is Manufacturing.
We can also see that two codes appear in the real but not the synthetic. These codes are:
10: Total, all industries
99: Unclassified
Instead of attempting to reformat the NAICS code ranges in the real dataset, lets simply remove these.
# Convert to string first to handle any non-numeric values, then to int
synth_nj_df_2dig['naics_sector'] = synth_nj_df_2dig['naics_sector'].astype(str).str.strip().astype(int)
real_nj_data_2dig['industry_code'] = real_nj_data_2dig['industry_code'].astype(str).str.strip().astype(int)
# First get the intersection of common NAICS codes
common_codes = set(synth_nj_df_2dig['naics_sector']).intersection(set(real_nj_data_2dig['industry_code']))
# Filter synthetic data to only keep rows with common codes
synth_nj_df_2dig = synth_nj_df_2dig[synth_nj_df_2dig['naics_sector'].isin(common_codes)]
# Filter real data to only keep rows with common codes
real_nj_data_2dig = real_nj_data_2dig[real_nj_data_2dig['industry_code'].isin(common_codes)]
/tmp/ipykernel_3444/3341525310.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy real_nj_data_2dig['industry_code'] = real_nj_data_2dig['industry_code'].astype(str).str.strip().astype(int)
Same as before we will take the differences in the emp values and remove any rows with zero as the real employment value.
real_nj_data_2dig['cnty'] = real_nj_data_2dig['cnty'].astype(int)
real_nj_data_2dig['state'] = real_nj_data_2dig['state'].astype(int)
real_nj_data_2dig['naics_sector'] = real_nj_data_2dig['industry_code']
# Merge the dataframes on county
merged_df = pd.merge(
synth_nj_df_2dig,
real_nj_data_2dig,
on=['year', 'qtr', 'state', 'cnty', 'naics_sector'],
suffixes=('_synth', '_real')
)
# Calculate differences
merged_df['m1emp_diff'] = merged_df['m1emp_synth'] - merged_df['m1emp_real']
merged_df['m2emp_diff'] = merged_df['m2emp_synth'] - merged_df['m2emp_real']
merged_df['m3emp_diff'] = merged_df['m3emp_synth'] - merged_df['m3emp_real']
# Select relevant columns for the result
result_df_2dig = merged_df[['year', 'qtr', 'state', 'cnty', 'naics_sector',
'm1emp_synth', 'm1emp_real', 'm1emp_diff',
'm2emp_synth', 'm2emp_real', 'm2emp_diff',
'm3emp_synth', 'm3emp_real', 'm3emp_diff']]
result_df_2dig = result_df_2dig[result_df_2dig['m1emp_real'] != 0]
Now we will take some summary statistics of the differences in employment values
def empresultout(result_df, naicslevel):
# Summary stats for differences
diff_stats = result_df[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']].describe()
print("Difference Statistics:")
print(diff_stats)
# Correlation matrix
corr_matrix = result_df[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']].corr()
print("\nCorrelation Matrix:")
print(corr_matrix)
# By NAICS sector
sector_stats = result_df.groupby(naicslevel)[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']].median()
print("\nMedian Differences by NAICS Sector:")
print(sector_stats)
##### Plots #####
## Box - no fliers
# Calculate Relative differences
result_df['m1emp_reldiff'] = (result_df['m1emp_synth'] - result_df['m1emp_real']) / result_df['m1emp_real']
result_df['m2emp_reldiff'] = (result_df['m2emp_synth'] - result_df['m2emp_real']) / result_df['m2emp_real']
result_df['m3emp_reldiff'] = (result_df['m3emp_synth'] - result_df['m3emp_real']) / result_df['m3emp_real']
plt.style.use('seaborn-v0_8') # Modern seaborn style
plt.figure(figsize=(15, 5)) # Adjusted figure size for single plot
fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(15,5))
sns.boxplot(data=result_df[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']],
showfliers=False,
ax=ax1
)
ax1.set_title(f'Distribution of Employment Differences (Synthetic - Real) ({naicslevel})')
ax1.set_ylabel('Difference in Employment Counts')
ax1.set_xlabel('Employment Metric')
ax1.axhline(0, color='red', linestyle='--', alpha=0.7) # Add reference line at 0
ax1.set_xticks([0, 1, 2], ['M1EMP', 'M2EMP', 'M3EMP'])
sns.boxplot(data=result_df[['m1emp_reldiff', 'm2emp_reldiff', 'm3emp_reldiff']],
showfliers=False,
ax=ax2)
ax2.set_title(f'Relative Employment Differences\n(Synthetic - Real) / (Real) ({naicslevel})')
ax2.set_ylabel('Relative Difference')
ax2.set_xlabel('Employment Metric')
ax2.axhline(0, color='red', linestyle='--', alpha=0.7)
ax2.set_xticks([0, 1, 2], ['M1EMP', 'M2EMP', 'M3EMP'])
plt.tight_layout()
plt.show()
## Box with fliers
plt.style.use('seaborn-v0_8') # Modern seaborn style
plt.figure(figsize=(15, 5)) # Adjusted figure size for single plot
fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(15,5))
sns.boxplot(data=result_df[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']],
showfliers=True,
ax=ax1
)
ax1.set_title(f'Distribution of Employment Differences (Synthetic - Real) ({naicslevel})')
ax1.set_ylabel('Difference in Employment Counts')
ax1.set_xlabel('Employment Metric')
ax1.axhline(0, color='red', linestyle='--', alpha=0.7) # Add reference line at 0
ax1.set_xticks([0, 1, 2], ['M1EMP', 'M2EMP', 'M3EMP'])
sns.boxplot(data=result_df[['m1emp_reldiff', 'm2emp_reldiff', 'm3emp_reldiff']],
showfliers=True,
ax=ax2)
ax2.set_title(f'Relative Employment Differences\n(Synthetic - Real) / (Real) ({naicslevel})')
ax2.set_ylabel('Relative Difference')
ax2.set_xlabel('Employment Metric')
ax2.axhline(0, color='red', linestyle='--', alpha=0.7)
ax2.set_xticks([0, 1, 2], ['M1EMP', 'M2EMP', 'M3EMP'])
plt.tight_layout()
plt.show()
plt.style.use('seaborn-v0_8')
plt.figure(figsize=(15, 5))
metrics = ['m1emp', 'm2emp', 'm3emp']
for i, metric in enumerate(metrics):
plt.subplot(1, 3, i+1)
# Scatter plot
sns.scatterplot(
data=result_df,
x=f'{metric}_real',
y=f'{metric}_synth',
s=60,
alpha=0.7
)
# Fit regression with statsmodels (for CI)
X = sm.add_constant(result_df[f'{metric}_real']) # Adds intercept
model = sm.OLS(result_df[f'{metric}_synth'], X).fit()
# Generate predictions with CI
predictions = model.get_prediction(X)
pred_frame = predictions.summary_frame(alpha=0.05) # 95% CI
# Plot regression line
plt.plot(
result_df[f'{metric}_real'],
model.predict(X),
color='blue',
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
)
# Plot CI
plt.fill_between(
result_df[f'{metric}_real'],
pred_frame['obs_ci_lower'],
pred_frame['obs_ci_upper'],
color='blue',
alpha=0.2,
label='95% CI'
)
# Perfect agreement line
max_val = max(result_df[f'{metric}_real'].max(), result_df[f'{metric}_synth'].max())
plt.plot([0, max_val], [0, max_val], 'r--', label='Perfect agreement')
plt.title(f'Synthetic vs Real Employment ({metric.upper()}) ({naicslevel})')
plt.xlabel('Real Employment')
plt.ylabel('Synthetic Employment')
plt.legend()
plt.tight_layout()
plt.show()
def empresultout(result_df, naicslevel):
# Summary stats for differences
diff_stats = result_df[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']].describe()
print("Difference Statistics:")
print(diff_stats)
# Correlation matrix
corr_matrix = result_df[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']].corr()
print("\nCorrelation Matrix:")
print(corr_matrix)
# By NAICS sector
sector_stats = result_df.groupby(naicslevel)[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']].median()
print("\nMedian Differences by NAICS Sector:")
print(sector_stats)
##### Plots #####
## Boxplots (without fliers)
result_df['m1emp_reldiff'] = (result_df['m1emp_synth'] - result_df['m1emp_real']) / result_df['m1emp_real']
result_df['m2emp_reldiff'] = (result_df['m2emp_synth'] - result_df['m2emp_real']) / result_df['m2emp_real']
result_df['m3emp_reldiff'] = (result_df['m3emp_synth'] - result_df['m3emp_real']) / result_df['m3emp_real']
plt.style.use('seaborn-v0_8')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))
sns.boxplot(data=result_df[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']], showfliers=False, ax=ax1)
ax1.set_title(f'Distribution of Employment Differences (Synthetic - Real) ({naicslevel})')
ax1.set_ylabel('Difference in Employment Counts')
ax1.set_xlabel('Employment Metric')
ax1.axhline(0, color='red', linestyle='--', alpha=0.7)
ax1.set_xticks([0,1,2], ['M1EMP', 'M2EMP', 'M3EMP'])
sns.boxplot(data=result_df[['m1emp_reldiff', 'm2emp_reldiff', 'm3emp_reldiff']], showfliers=False, ax=ax2)
ax2.set_title(f'Relative Employment Differences\n(Synthetic - Real) / (Real) ({naicslevel})')
ax2.set_ylabel('Relative Difference')
ax2.set_xlabel('Employment Metric')
ax2.axhline(0, color='red', linestyle='--', alpha=0.7)
ax2.set_xticks([0,1,2], ['M1EMP', 'M2EMP', 'M3EMP'])
plt.tight_layout()
plt.show()
## Boxplots (with fliers)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))
sns.boxplot(data=result_df[['m1emp_diff', 'm2emp_diff', 'm3emp_diff']], showfliers=True, ax=ax1)
ax1.set_title(f'Distribution of Employment Differences (Synthetic - Real) ({naicslevel})')
ax1.set_ylabel('Difference in Employment Counts')
ax1.set_xlabel('Employment Metric')
ax1.axhline(0, color='red', linestyle='--', alpha=0.7)
ax1.set_xticks([0,1,2], ['M1EMP', 'M2EMP', 'M3EMP'])
sns.boxplot(data=result_df[['m1emp_reldiff', 'm2emp_reldiff', 'm3emp_reldiff']], showfliers=True, ax=ax2)
ax2.set_title(f'Relative Employment Differences\n(Synthetic - Real) / (Real) ({naicslevel})')
ax2.set_ylabel('Relative Difference')
ax2.set_xlabel('Employment Metric')
ax2.axhline(0, color='red', linestyle='--', alpha=0.7)
ax2.set_xticks([0,1,2], ['M1EMP', 'M2EMP', 'M3EMP'])
plt.tight_layout()
plt.show()
## Scatter plots with regression
metrics = ['m1emp', 'm2emp', 'm3emp']
plt.figure(figsize=(15,5))
for i, metric in enumerate(metrics):
plt.subplot(1, 3, i+1)
sns.scatterplot(data=result_df, x=f'{metric}_real', y=f'{metric}_synth', s=60, alpha=0.7)
X = sm.add_constant(result_df[f'{metric}_real'])
model = sm.OLS(result_df[f'{metric}_synth'], X).fit()
predictions = model.get_prediction(X)
pred_frame = predictions.summary_frame(alpha=0.05)
plt.plot(result_df[f'{metric}_real'], model.predict(X), color='blue',
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
plt.fill_between(result_df[f'{metric}_real'], pred_frame['obs_ci_lower'], pred_frame['obs_ci_upper'],
color='blue', alpha=0.2, label='95% CI')
max_val = max(result_df[f'{metric}_real'].max(), result_df[f'{metric}_synth'].max())
plt.plot([0, max_val], [0, max_val], 'r--', label='Perfect agreement')
plt.title(f'Synthetic vs Real Employment ({metric.upper()}) ({naicslevel})')
plt.xlabel('Real Employment')
plt.ylabel('Synthetic Employment')
plt.legend()
plt.tight_layout()
plt.show()
## Employment distribution histograms (linear and log-scale)
plt.figure(figsize=(15, 4))
for i, metric in enumerate(metrics):
plt.subplot(1, 3, i+1)
sns.histplot(result_df[f'{metric}_real'], color='blue', kde=True, label='Real', alpha=0.6)
sns.histplot(result_df[f'{metric}_synth'], color='orange', kde=True, label='Synthetic', alpha=0.6)
plt.title(f'Distribution of Real vs Synthetic Employment ({metric.upper()}) ({naicslevel})')
plt.xlabel('Employment')
plt.ylabel('Count')
plt.legend()
plt.tight_layout()
plt.show()
# Log-scale version using log1p
plt.figure(figsize=(15, 4))
for i, metric in enumerate(metrics):
plt.subplot(1, 3, i+1)
sns.histplot(np.log1p(result_df[f'{metric}_real']), color='blue', kde=True, label='Real', alpha=0.6)
sns.histplot(np.log1p(result_df[f'{metric}_synth']), color='orange', kde=True, label='Synthetic', alpha=0.6)
plt.title(f'Log-Scale Distribution of Real vs Synthetic Employment ({metric.upper()}) ({naicslevel})')
plt.xlabel('log(1 + Employment)')
plt.ylabel('Count')
plt.legend()
plt.tight_layout()
plt.show()
empresultout(result_df_2dig, 'naics_sector')
Difference Statistics:
m1emp_diff m2emp_diff m3emp_diff
count 308.000000 308.000000 308.000000
mean 710.801948 565.561688 341.103896
std 1978.140020 1413.334007 1088.417584
min -3858.000000 -3766.000000 -3664.000000
25% -43.500000 -39.000000 -67.250000
50% 165.000000 196.000000 107.500000
75% 878.500000 762.250000 539.250000
max 16435.000000 8518.000000 6317.000000
Correlation Matrix:
m1emp_diff m2emp_diff m3emp_diff
m1emp_diff 1.000000 0.953335 0.650284
m2emp_diff 0.953335 1.000000 0.842215
m3emp_diff 0.650284 0.842215 1.000000
Median Differences by NAICS Sector:
m1emp_diff m2emp_diff m3emp_diff
naics_sector
11 -60.0 -78.5 -189.0
21 21.5 25.5 -3.0
22 51.0 45.0 17.5
23 342.0 525.0 351.0
42 1202.0 1026.5 873.5
51 -555.0 -568.0 -577.0
52 -8.0 11.0 -31.0
53 146.0 151.0 115.0
54 601.0 646.0 734.0
55 143.0 149.0 139.0
56 1481.0 1521.0 1167.0
61 275.0 214.0 246.0
62 1514.0 871.0 705.0
71 412.0 384.0 210.0
72 313.0 362.0 241.0
81 -173.0 -190.0 -178.0
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
real_nj_data_3dig = real_nj_data[
(real_nj_data['industry_code'].str.len() == 3) &
(real_nj_data['own_code'] == 5)
]
synth_nj_df_3dig = synth_nj_df.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics3']
).agg({
'm1emp': 'sum',
'm2emp': 'sum',
'm3emp': 'sum'
}).reset_index()
# Convert to string first to handle any non-numeric values, then to int
synth_nj_df_3dig['naics3'] = synth_nj_df_3dig['naics3'].astype(str).str.strip().astype(int)
real_nj_data_3dig['industry_code'] = real_nj_data_3dig['industry_code'].astype(str).str.strip().astype(int)
# First get the intersection of common NAICS codes
common_codes = set(synth_nj_df_3dig['naics3']).intersection(set(real_nj_data_3dig['industry_code']))
# Filter synthetic data to only keep rows with common codes
synth_nj_df_3dig = synth_nj_df_3dig[synth_nj_df_3dig['naics3'].isin(common_codes)]
# Filter real data to only keep rows with common codes
real_nj_data_3dig = real_nj_data_3dig[real_nj_data_3dig['industry_code'].isin(common_codes)]
real_nj_data_3dig['cnty'] = real_nj_data_3dig['cnty'].astype(int)
real_nj_data_3dig['state'] = real_nj_data_3dig['state'].astype(int)
real_nj_data_3dig['naics3'] = real_nj_data_3dig['industry_code']
# Merge the dataframes on county
merged_df = pd.merge(
synth_nj_df_3dig,
real_nj_data_3dig,
on=['year', 'qtr', 'state', 'cnty', 'naics3'],
suffixes=('_synth', '_real')
)
# Calculate differences
merged_df['m1emp_diff'] = merged_df['m1emp_synth'] - merged_df['m1emp_real']
merged_df['m2emp_diff'] = merged_df['m2emp_synth'] - merged_df['m2emp_real']
merged_df['m3emp_diff'] = merged_df['m3emp_synth'] - merged_df['m3emp_real']
# Select relevant columns for the result
result_df_3dig = merged_df[['year', 'qtr', 'state', 'cnty', 'naics3',
'm1emp_synth', 'm1emp_real', 'm1emp_diff',
'm2emp_synth', 'm2emp_real', 'm2emp_diff',
'm3emp_synth', 'm3emp_real', 'm3emp_diff']]
result_df_3dig = result_df_3dig[result_df_3dig['m1emp_real'] != 0]
/tmp/ipykernel_3444/1263198173.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy real_nj_data_3dig['industry_code'] = real_nj_data_3dig['industry_code'].astype(str).str.strip().astype(int)
empresultout(result_df_3dig, 'naics3')
Difference Statistics:
m1emp_diff m2emp_diff m3emp_diff
count 1086.000000 1086.000000 1086.000000
mean 372.613260 269.972376 142.829650
std 1105.792613 731.462695 527.500573
min -1807.000000 -1819.000000 -1823.000000
25% -5.000000 -4.000000 -18.000000
50% 54.000000 52.500000 17.500000
75% 312.750000 273.500000 148.250000
max 13785.000000 7163.000000 6259.000000
Correlation Matrix:
m1emp_diff m2emp_diff m3emp_diff
m1emp_diff 1.000000 0.947981 0.538835
m2emp_diff 0.947981 1.000000 0.771684
m3emp_diff 0.538835 0.771684 1.000000
Median Differences by NAICS Sector:
m1emp_diff m2emp_diff m3emp_diff
naics3
114 35.0 20.5 -7.0
115 8.5 14.0 -6.0
212 -10.5 -6.5 -8.5
221 51.0 45.0 17.5
236 25.0 45.0 34.0
... ... ... ...
721 49.0 51.0 6.0
722 27.0 190.0 149.0
811 1.0 5.0 4.0
812 1.0 -2.0 7.0
813 60.0 88.0 105.0
[69 rows x 3 columns]
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
real_nj_data_4dig = real_nj_data[
(real_nj_data['industry_code'].str.len() == 4) &
(real_nj_data['own_code'] == 5)
]
synth_nj_df_4dig = synth_nj_df.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics4']
).agg({
'm1emp': 'sum',
'm2emp': 'sum',
'm3emp': 'sum'
}).reset_index()
# Convert to string first to handle any non-numeric values, then to int
synth_nj_df_4dig['naics4'] = synth_nj_df_4dig['naics4'].astype(str).str.strip().astype(int)
real_nj_data_4dig['industry_code'] = real_nj_data_4dig['industry_code'].astype(str).str.strip().astype(int)
# First get the intersection of common NAICS codes
common_codes = set(synth_nj_df_4dig['naics4']).intersection(set(real_nj_data_4dig['industry_code']))
# Filter synthetic data to only keep rows with common codes
synth_nj_df_4dig = synth_nj_df_4dig[synth_nj_df_4dig['naics4'].isin(common_codes)]
# Filter real data to only keep rows with common codes
real_nj_data_4dig = real_nj_data_4dig[real_nj_data_4dig['industry_code'].isin(common_codes)]
real_nj_data_4dig['cnty'] = real_nj_data_4dig['cnty'].astype(int)
real_nj_data_4dig['state'] = real_nj_data_4dig['state'].astype(int)
real_nj_data_4dig['naics4'] = real_nj_data_4dig['industry_code']
# Merge the dataframes on county
merged_df = pd.merge(
synth_nj_df_4dig,
real_nj_data_4dig,
on=['year', 'qtr', 'state', 'cnty', 'naics4'],
suffixes=('_synth', '_real')
)
# Calculate differences
merged_df['m1emp_diff'] = merged_df['m1emp_synth'] - merged_df['m1emp_real']
merged_df['m2emp_diff'] = merged_df['m2emp_synth'] - merged_df['m2emp_real']
merged_df['m3emp_diff'] = merged_df['m3emp_synth'] - merged_df['m3emp_real']
# Select relevant columns for the result
result_df_4dig = merged_df[['year', 'qtr', 'state', 'cnty', 'naics4',
'm1emp_synth', 'm1emp_real', 'm1emp_diff',
'm2emp_synth', 'm2emp_real', 'm2emp_diff',
'm3emp_synth', 'm3emp_real', 'm3emp_diff']]
result_df_4dig = result_df_4dig[result_df_4dig['m1emp_real'] != 0]
/tmp/ipykernel_3444/4140429712.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy real_nj_data_4dig['industry_code'] = real_nj_data_4dig['industry_code'].astype(str).str.strip().astype(int)
empresultout(result_df_4dig, 'naics4')
Difference Statistics:
m1emp_diff m2emp_diff m3emp_diff
count 2786.000000 2786.000000 2786.000000
mean 54.229361 57.115937 51.221464
std 262.373830 252.534638 247.453295
min -2045.000000 -2121.000000 -2503.000000
25% -8.750000 -6.000000 -8.000000
50% 7.500000 10.000000 7.000000
75% 51.000000 55.750000 50.000000
max 5456.000000 5400.000000 5292.000000
Correlation Matrix:
m1emp_diff m2emp_diff m3emp_diff
m1emp_diff 1.000000 0.983852 0.961609
m2emp_diff 0.983852 1.000000 0.984044
m3emp_diff 0.961609 0.984044 1.000000
Median Differences by NAICS Sector:
m1emp_diff m2emp_diff m3emp_diff
naics4
1141 -8.0 -5.0 -10.0
1151 7.0 7.0 5.0
1152 -3.5 -5.0 -4.5
2123 -10.5 -6.5 -8.5
2211 8.5 18.0 17.5
... ... ... ...
8131 12.0 16.0 18.0
8132 4.5 9.0 4.5
8133 0.0 1.0 4.5
8134 16.0 14.5 13.0
8139 40.0 41.0 45.0
[223 rows x 3 columns]
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
real_nj_data_5dig = real_nj_data[
(real_nj_data['industry_code'].str.len() == 5) &
(real_nj_data['own_code'] == 5) &
(~real_nj_data['industry_code'].str.contains('-')) # Excludes any codes with hyphens
]
synth_nj_df_5dig = synth_nj_df.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics5']
).agg({
'm1emp': 'sum',
'm2emp': 'sum',
'm3emp': 'sum'
}).reset_index()
# Convert to string first to handle any non-numeric values, then to int
synth_nj_df_5dig['naics5'] = synth_nj_df_5dig['naics5'].astype(str).str.strip().astype(int)
real_nj_data_5dig['industry_code'] = real_nj_data_5dig['industry_code'].astype(str).str.strip().astype(int)
# First get the intersection of common NAICS codes
common_codes = set(synth_nj_df_5dig['naics5']).intersection(set(real_nj_data_5dig['industry_code']))
# Filter synthetic data to only keep rows with common codes
synth_nj_df_5dig = synth_nj_df_5dig[synth_nj_df_5dig['naics5'].isin(common_codes)]
# Filter real data to only keep rows with common codes
real_nj_data_5dig = real_nj_data_5dig[real_nj_data_5dig['industry_code'].isin(common_codes)]
real_nj_data_5dig['cnty'] = real_nj_data_5dig['cnty'].astype(int)
real_nj_data_5dig['state'] = real_nj_data_5dig['state'].astype(int)
real_nj_data_5dig['naics5'] = real_nj_data_5dig['industry_code']
# Merge the dataframes on county
merged_df = pd.merge(
synth_nj_df_5dig,
real_nj_data_5dig,
on=['year', 'qtr', 'state', 'cnty', 'naics5'],
suffixes=('_synth', '_real')
)
# Calculate differences
merged_df['m1emp_diff'] = merged_df['m1emp_synth'] - merged_df['m1emp_real']
merged_df['m2emp_diff'] = merged_df['m2emp_synth'] - merged_df['m2emp_real']
merged_df['m3emp_diff'] = merged_df['m3emp_synth'] - merged_df['m3emp_real']
# Select relevant columns for the result
result_df_5dig = merged_df[['year', 'qtr', 'state', 'cnty', 'naics5',
'm1emp_synth', 'm1emp_real', 'm1emp_diff',
'm2emp_synth', 'm2emp_real', 'm2emp_diff',
'm3emp_synth', 'm3emp_real', 'm3emp_diff']]
result_df_5dig = result_df_5dig[result_df_5dig['m1emp_real'] != 0]
/tmp/ipykernel_3444/1684374336.py:15: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy real_nj_data_5dig['industry_code'] = real_nj_data_5dig['industry_code'].astype(str).str.strip().astype(int)
empresultout(result_df_5dig, 'naics5')
Difference Statistics:
m1emp_diff m2emp_diff m3emp_diff
count 4624.000000 4624.000000 4624.000000
mean 32.813798 34.428633 30.943555
std 256.709583 250.284880 247.564529
min -2206.000000 -2399.000000 -2503.000000
25% -25.000000 -24.000000 -25.000000
50% 4.000000 5.000000 4.000000
75% 50.000000 52.250000 50.000000
max 6855.000000 6172.000000 5883.000000
Correlation Matrix:
m1emp_diff m2emp_diff m3emp_diff
m1emp_diff 1.000000 0.990653 0.974343
m2emp_diff 0.990653 1.000000 0.988957
m3emp_diff 0.974343 0.988957 1.000000
Median Differences by NAICS Sector:
m1emp_diff m2emp_diff m3emp_diff
naics5
11411 -8.0 -5.0 -10.0
11511 7.0 7.0 5.0
11521 -3.5 -5.0 -4.5
21232 7.0 9.5 5.0
22111 -135.0 -137.0 -133.0
... ... ... ...
81391 -31.0 -33.0 -35.0
81392 -18.0 -17.0 -30.0
81393 45.0 45.5 34.5
81394 -10.0 -8.0 -5.0
81399 36.0 42.0 42.0
[463 rows x 3 columns]
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
real_nj_data_6dig = real_nj_data[
(real_nj_data['industry_code'].str.len() == 6) &
(real_nj_data['own_code'] == 5)
]
synth_nj_df_6dig = synth_nj_df.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics']
).agg({
'm1emp': 'sum',
'm2emp': 'sum',
'm3emp': 'sum'
}).reset_index()
# Convert to string first to handle any non-numeric values, then to int
synth_nj_df_6dig['naics'] = synth_nj_df_6dig['naics'].astype(str).str.strip().astype(int)
real_nj_data_6dig['industry_code'] = real_nj_data_6dig['industry_code'].astype(str).str.strip().astype(int)
# First get the intersection of common NAICS codes
common_codes = set(synth_nj_df_6dig['naics']).intersection(set(real_nj_data_6dig['industry_code']))
# Filter synthetic data to only keep rows with common codes
synth_nj_df_6dig = synth_nj_df_6dig[synth_nj_df_6dig['naics'].isin(common_codes)]
# Filter real data to only keep rows with common codes
real_nj_data_6dig = real_nj_data_6dig[real_nj_data_6dig['industry_code'].isin(common_codes)]
real_nj_data_6dig['cnty'] = real_nj_data_6dig['cnty'].astype(int)
real_nj_data_6dig['state'] = real_nj_data_6dig['state'].astype(int)
real_nj_data_6dig['naics'] = real_nj_data_6dig['industry_code']
# Merge the dataframes on county
merged_df = pd.merge(
synth_nj_df_6dig,
real_nj_data_6dig,
on=['year', 'qtr', 'state', 'cnty', 'naics'],
suffixes=('_synth', '_real')
)
# Calculate differences
merged_df['m1emp_diff'] = merged_df['m1emp_synth'] - merged_df['m1emp_real']
merged_df['m2emp_diff'] = merged_df['m2emp_synth'] - merged_df['m2emp_real']
merged_df['m3emp_diff'] = merged_df['m3emp_synth'] - merged_df['m3emp_real']
# Select relevant columns for the result
result_df_6dig = merged_df[['year', 'qtr', 'state', 'cnty', 'naics',
'm1emp_synth', 'm1emp_real', 'm1emp_diff',
'm2emp_synth', 'm2emp_real', 'm2emp_diff',
'm3emp_synth', 'm3emp_real', 'm3emp_diff']]
result_df_6dig = result_df_6dig[result_df_6dig['m1emp_real'] != 0]
/tmp/ipykernel_3444/1127110994.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy real_nj_data_6dig['industry_code'] = real_nj_data_6dig['industry_code'].astype(str).str.strip().astype(int)
empresultout(result_df_6dig, 'naics')
Difference Statistics:
m1emp_diff m2emp_diff m3emp_diff
count 4772.000000 4772.000000 4772.000000
mean 28.518022 29.565172 26.686295
std 306.272291 301.360541 299.624634
min -5557.000000 -5631.000000 -5997.000000
25% -29.000000 -28.000000 -30.000000
50% 3.000000 3.000000 2.000000
75% 50.000000 51.000000 50.000000
max 6855.000000 6172.000000 6013.000000
Correlation Matrix:
m1emp_diff m2emp_diff m3emp_diff
m1emp_diff 1.000000 0.994068 0.983477
m2emp_diff 0.994068 1.000000 0.993139
m3emp_diff 0.983477 0.993139 1.000000
Median Differences by NAICS Sector:
m1emp_diff m2emp_diff m3emp_diff
naics
114111 -11.0 6.0 3.0
115115 7.0 7.0 5.0
115210 -3.5 -5.0 -4.5
221112 -70.0 -74.0 -74.0
221122 51.0 52.0 51.0
... ... ... ...
813910 -31.0 -33.0 -35.0
813920 -18.0 -17.0 -30.0
813930 45.0 45.5 34.5
813940 -10.0 -8.0 -5.0
813990 36.0 42.0 42.0
[549 rows x 3 columns]
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
/tmp/ipykernel_3444/563773864.py:77: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}')
We will do a similar analysis as before; however, we will now focus on wages rather than employment counts
Same as before, lets start at the 2-digit naics level.
I leave two result functions in allowing you to see the scatter plots on a log scale or untransformed.
real_nj_data_2dig = real_nj_data[
(real_nj_data['industry_code'].str.len() == 2) &
(real_nj_data['own_code'] == 5)
]
synth_nj_df_temp = synth_nj_df.copy()
synth_nj_df_temp['wage'] = synth_nj_df['wage'] * 1000
synth_nj_df_2dig = synth_nj_df_temp.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics_sector']
).agg({
'wage': 'sum'
}).reset_index()
# # Convert to string first to handle any non-numeric values, then to int
synth_nj_df_2dig['naics_sector'] = synth_nj_df_2dig['naics_sector'].astype(str).str.strip().astype(int)
real_nj_data_2dig['industry_code'] = real_nj_data_2dig['industry_code'].astype(str).str.strip().astype(int)
# # First get the intersection of common NAICS codes
common_codes = set(synth_nj_df_2dig['naics_sector']).intersection(set(real_nj_data_2dig['industry_code']))
# # Filter synthetic data to only keep rows with common codes
synth_nj_df_2dig = synth_nj_df_2dig[synth_nj_df_2dig['naics_sector'].isin(common_codes)]
# # Filter real data to only keep rows with common codes
real_nj_data_2dig = real_nj_data_2dig[real_nj_data_2dig['industry_code'].isin(common_codes)]
real_nj_data_2dig['cnty'] = real_nj_data_2dig['cnty'].astype(int)
real_nj_data_2dig['state'] = real_nj_data_2dig['state'].astype(int)
real_nj_data_2dig['naics_sector'] = real_nj_data_2dig['industry_code']
# # Merge the dataframes on county
merged_df = pd.merge(
synth_nj_df_2dig,
real_nj_data_2dig,
on=['year', 'qtr', 'state', 'cnty', 'naics_sector']
)
# # # Calculate differences
merged_df['wage_diff'] = merged_df['wage'] - merged_df['total_qtrly_wages']
merged_df['rel_wage_diff'] = (merged_df['wage'] - merged_df['total_qtrly_wages']) / merged_df['total_qtrly_wages']
# # Select relevant columns for the result
result_df_2dig = merged_df[[
'year', 'qtr', 'state', 'cnty', 'naics_sector',
'wage', 'total_qtrly_wages','wage_diff', 'rel_wage_diff'
]]
result_df_2dig = result_df_2dig[result_df_2dig['total_qtrly_wages'] != 0]
/tmp/ipykernel_3444/920177403.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy real_nj_data_2dig['industry_code'] = real_nj_data_2dig['industry_code'].astype(str).str.strip().astype(int)
def wageresultout(result_df, naicslevel):
# Filter out zeros and negative values before log transform
plot_df = result_df[(result_df['total_qtrly_wages'] > 0) &
(result_df['wage'] > 0)].copy()
# Summary stats for differences
diff_stats = result_df[['wage_diff', 'rel_wage_diff']].describe()
print("Difference Statistics:")
print(diff_stats)
# By NAICS sector
sector_stats = result_df.groupby(naicslevel)[['wage_diff', 'rel_wage_diff']].median()
print("\nMedian Differences by NAICS Sector:")
print(sector_stats)
##### Plots #####
## Box - no fliers
plt.style.use('seaborn-v0_8')
plt.figure(figsize=(5, 5))
sns.boxplot(data=result_df['wage_diff'], showfliers=False)
plt.title(f'Distribution of Wage Differences (Synthetic - Real) ({naicslevel})')
plt.ylabel('Difference in Wage Counts')
plt.xlabel('Wage Difference')
plt.axhline(0, color='red', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
## Relative differences - no fliers
plt.figure(figsize=(5, 5))
sns.boxplot(data=result_df['rel_wage_diff'], showfliers=False)
plt.title(f'Relative Wage Differences (Synthetic - Real)/Real ({naicslevel})')
plt.ylabel('Relative Difference')
plt.xlabel('Relative Wage Difference')
plt.axhline(0, color='red', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
## Box with fliers
plt.figure(figsize=(5, 5))
sns.boxplot(data=result_df['wage_diff'], showfliers=True)
plt.title(f'Distribution of Wage Differences (Synthetic - Real) ({naicslevel})')
plt.ylabel('Difference in Wage Counts')
plt.xlabel('Wage Difference')
plt.axhline(0, color='red', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
## Relative differences with fliers
plt.figure(figsize=(5, 5))
sns.boxplot(data=result_df['rel_wage_diff'], showfliers=True)
plt.title(f'Relative Wage Differences (Synthetic - Real)/Real ({naicslevel})')
plt.ylabel('Relative Difference')
plt.xlabel('Relative Wage Difference')
plt.axhline(0, color='red', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
##### Log-transformed Scatter Plot #####
plt.style.use('seaborn-v0_8')
plt.figure(figsize=(5, 5))
# Create log-transformed columns
plot_df['log_real'] = np.log10(plot_df['total_qtrly_wages'])
plot_df['log_synth'] = np.log10(plot_df['wage'])
plt.figure(figsize=(5, 5))
# Scatter plot
sns.scatterplot(
data=result_df,
x='total_qtrly_wages',
y='wage',
s=60,
alpha=0.7
)
# Fit regression with statsmodels (for CI)
X = sm.add_constant(result_df['total_qtrly_wages']) # Adds intercept
model = sm.OLS(result_df['wage'], X).fit()
# Generate predictions with CI
predictions = model.get_prediction(X)
pred_frame = predictions.summary_frame(alpha=0.05) # 95% CI
# Plot regression line
plt.plot(
result_df['total_qtrly_wages'],
model.predict(X),
color='blue',
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
)
# Plot CI
plt.fill_between(
result_df['total_qtrly_wages'],
pred_frame['obs_ci_lower'],
pred_frame['obs_ci_upper'],
color='blue',
alpha=0.2,
label='95% CI'
)
# Perfect agreement line
max_val = max(result_df['total_qtrly_wages'].max(), result_df['wage'].max())
plt.plot([0, max_val], [0, max_val], 'r--', label='Perfect agreement')
plt.title(f'Synthetic vs Real Wages ({naicslevel})')
plt.xlabel('Real Wages')
plt.ylabel('Synthetic Wages')
plt.legend()
plt.tight_layout()
plt.show()
# Scatter plot with log scale
sns.scatterplot(
data=plot_df,
x='log_real',
y='log_synth',
s=60,
alpha=0.7
)
# Fit regression on log-transformed data
X = sm.add_constant(plot_df['log_real'])
model = sm.OLS(plot_df['log_synth'], X).fit()
# Generate predictions with CI
predictions = model.get_prediction(X)
pred_frame = predictions.summary_frame(alpha=0.05)
# Plot regression line
plt.plot(
plot_df['log_real'],
model.predict(X),
color='blue',
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
)
# Plot CI
plt.fill_between(
plot_df['log_real'],
pred_frame['obs_ci_lower'],
pred_frame['obs_ci_upper'],
color='blue',
alpha=0.2,
label='95% CI'
)
# Perfect agreement line
max_log = max(plot_df['log_real'].max(), plot_df['log_synth'].max())
min_log = min(plot_df['log_real'].min(), plot_df['log_synth'].min())
plt.plot([min_log, max_log], [min_log, max_log], 'r--', label='Perfect agreement')
plt.title(f'Log10 Synthetic vs Real Wages ({naicslevel})')
plt.xlabel('Log10(Real Wages)')
plt.ylabel('Log10(Synthetic Wages)')
plt.legend()
plt.tight_layout()
plt.show()
# Wage distribution histograms
plt.figure(figsize=(6, 4))
sns.histplot(result_df['total_qtrly_wages'], color='blue', kde=True, label='Real Wages', alpha=0.6)
sns.histplot(result_df['wage'], color='orange', kde=True, label='Synthetic Wages', alpha=0.6)
plt.title(f'Distribution of Real vs Synthetic Wages ({naicslevel})')
plt.xlabel('Wages')
plt.ylabel('Count')
plt.legend()
plt.tight_layout()
plt.show()
# Log-scale version
plt.figure(figsize=(6, 4))
sns.histplot(np.log1p(result_df['total_qtrly_wages']), color='blue', kde=True, label='Real Wages', alpha=0.6)
sns.histplot(np.log1p(result_df['wage']), color='orange', kde=True, label='Synthetic Wages', alpha=0.6)
plt.title(f'Log-Scale Distribution of Real vs Synthetic Wages ({naicslevel})')
plt.xlabel('log(1 + Wages)')
plt.ylabel('Count')
plt.legend()
plt.tight_layout()
plt.show()
wageresultout(result_df_2dig, 'naics_sector')
Difference Statistics:
wage_diff rel_wage_diff
count 3.080000e+02 308.000000
mean 7.650135e+07 4.496001
std 5.043050e+08 37.237330
min -3.878087e+08 -0.994832
25% -8.125792e+05 -0.042533
50% 4.463297e+06 0.163608
75% 3.191077e+07 0.666251
max 8.002062e+09 469.994905
Median Differences by NAICS Sector:
wage_diff rel_wage_diff
naics_sector
11 -1117239.0 -0.762632
21 14587.5 0.051459
22 43010681.0 2.082511
23 3000981.0 0.069810
42 95837295.5 1.016216
51 -2729935.0 -0.073603
52 8439122.0 0.081891
53 2427169.0 0.176392
54 351803.0 0.030849
55 7337134.0 0.112164
56 -264326.0 -0.004947
61 12204677.0 0.780503
62 99624319.0 0.406664
71 1266872.0 0.177912
72 942129.0 0.019837
81 5417103.0 0.184091
/tmp/ipykernel_3444/3840527632.py:91: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
<Figure size 500x500 with 0 Axes>
/tmp/ipykernel_3444/3840527632.py:138: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
real_nj_data_3dig = real_nj_data[
(real_nj_data['industry_code'].str.len() == 3) &
(real_nj_data['own_code'] == 5)
]
synth_nj_df_temp = synth_nj_df.copy()
synth_nj_df_temp['wage'] = synth_nj_df['wage'] * 1000
synth_nj_df_3dig = synth_nj_df_temp.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics3']
).agg({
'wage': 'sum'
}).reset_index()
# # Convert to string first to handle any non-numeric values, then to int
synth_nj_df_3dig['naics3'] = synth_nj_df_3dig['naics3'].astype(str).str.strip().astype(int)
real_nj_data_3dig['industry_code'] = real_nj_data_3dig['industry_code'].astype(str).str.strip().astype(int)
# # First get the intersection of common NAICS codes
common_codes = set(synth_nj_df_3dig['naics3']).intersection(set(real_nj_data_3dig['industry_code']))
# # Filter synthetic data to only keep rows with common codes
synth_nj_df_3dig = synth_nj_df_3dig[synth_nj_df_3dig['naics3'].isin(common_codes)]
# # Filter real data to only keep rows with common codes
real_nj_data_3dig = real_nj_data_3dig[real_nj_data_3dig['industry_code'].isin(common_codes)]
real_nj_data_3dig['cnty'] = real_nj_data_3dig['cnty'].astype(int)
real_nj_data_3dig['state'] = real_nj_data_3dig['state'].astype(int)
real_nj_data_3dig['naics3'] = real_nj_data_3dig['industry_code']
# # Merge the dataframes on county
merged_df = pd.merge(
synth_nj_df_3dig,
real_nj_data_3dig,
on=['year', 'qtr', 'state', 'cnty', 'naics3']
)
# # # Calculate differences
merged_df['wage_diff'] = merged_df['wage'] - merged_df['total_qtrly_wages']
merged_df['rel_wage_diff'] = (merged_df['wage'] - merged_df['total_qtrly_wages']) / merged_df['total_qtrly_wages']
# # Select relevant columns for the result
result_df_3dig = merged_df[[
'year', 'qtr', 'state', 'cnty', 'naics3',
'wage', 'total_qtrly_wages','wage_diff', 'rel_wage_diff'
]]
result_df_3dig = result_df_3dig[result_df_3dig['total_qtrly_wages'] != 0]
/tmp/ipykernel_3444/613502961.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy real_nj_data_3dig['industry_code'] = real_nj_data_3dig['industry_code'].astype(str).str.strip().astype(int)
wageresultout(result_df_3dig, 'naics3')
Difference Statistics:
wage_diff rel_wage_diff
count 1.086000e+03 1086.000000
mean 4.562397e+07 10.513295
std 3.778882e+08 93.274332
min -3.878087e+08 -0.990490
25% -1.067710e+06 -0.106062
50% 7.519410e+05 0.131011
75% 9.312716e+06 0.971821
max 8.002062e+09 1602.179743
Median Differences by NAICS Sector:
wage_diff rel_wage_diff
naics3
114 280202.0 0.592526
115 -56196.0 -0.107592
212 -42772.5 -0.074377
221 43010681.0 2.082511
236 -160521.0 -0.044275
... ... ...
721 -34972.0 -0.005334
722 -795397.0 -0.017715
811 99249.0 0.003272
812 -335726.0 -0.060504
813 6840073.0 0.788125
[69 rows x 2 columns]
/tmp/ipykernel_3444/3840527632.py:91: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
<Figure size 500x500 with 0 Axes>
/tmp/ipykernel_3444/3840527632.py:138: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
# Filter for 4-digit NAICS and ownership code 5
real_nj_data_4dig = real_nj_data[
(real_nj_data['industry_code'].str.len() == 4) &
(real_nj_data['own_code'] == 5)
]
# Prepare synthetic data - convert wage to same units and group by 4-digit NAICS
synth_nj_df_temp = synth_nj_df.copy()
synth_nj_df_temp['wage'] = synth_nj_df['wage'] * 1000
synth_nj_df_4dig = synth_nj_df_temp.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics4']
).agg({
'wage': 'sum'
}).reset_index()
# Convert to string first to handle any non-numeric values, then to int
synth_nj_df_4dig['naics4'] = synth_nj_df_4dig['naics4'].astype(str).str.strip().astype(int)
real_nj_data_4dig['industry_code'] = real_nj_data_4dig['industry_code'].astype(str).str.strip().astype(int)
# Get the intersection of common NAICS codes
common_codes = set(synth_nj_df_4dig['naics4']).intersection(set(real_nj_data_4dig['industry_code']))
# Filter synthetic data to only keep rows with common codes
synth_nj_df_4dig = synth_nj_df_4dig[synth_nj_df_4dig['naics4'].isin(common_codes)]
# Filter real data to only keep rows with common codes
real_nj_data_4dig = real_nj_data_4dig[real_nj_data_4dig['industry_code'].isin(common_codes)]
# Standardize column names and types
real_nj_data_4dig['cnty'] = real_nj_data_4dig['cnty'].astype(int)
real_nj_data_4dig['state'] = real_nj_data_4dig['state'].astype(int)
real_nj_data_4dig['naics4'] = real_nj_data_4dig['industry_code']
# Merge the dataframes on county
merged_df = pd.merge(
synth_nj_df_4dig,
real_nj_data_4dig,
on=['year', 'qtr', 'state', 'cnty', 'naics4']
)
# Calculate differences
merged_df['wage_diff'] = merged_df['wage'] - merged_df['total_qtrly_wages']
merged_df['rel_wage_diff'] = (merged_df['wage'] - merged_df['total_qtrly_wages']) / merged_df['total_qtrly_wages']
# Select relevant columns for the result
result_df_4dig = merged_df[[
'year', 'qtr', 'state', 'cnty', 'naics4',
'wage', 'total_qtrly_wages', 'wage_diff', 'rel_wage_diff'
]]
# Remove rows where real wages are zero to avoid division issues
result_df_4dig = result_df_4dig[result_df_4dig['total_qtrly_wages'] != 0]
/tmp/ipykernel_3444/4013910675.py:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy real_nj_data_4dig['industry_code'] = real_nj_data_4dig['industry_code'].astype(str).str.strip().astype(int)
wageresultout(result_df_4dig, 'naics4')
Difference Statistics:
wage_diff rel_wage_diff
count 2.786000e+03 2786.000000
mean 6.652201e+06 6.850864
std 1.098474e+08 131.850736
min -5.212664e+08 -0.990168
25% -1.040338e+06 -0.257228
50% -1.396650e+04 -0.010471
75% 1.260549e+06 0.372584
max 2.980660e+09 6478.504822
Median Differences by NAICS Sector:
wage_diff rel_wage_diff
naics4
1141 151963.0 0.133884
1151 24418.0 0.616897
1152 187704.5 3.618429
2123 -42772.5 -0.074377
2211 14032551.0 0.761735
... ... ...
8131 9512125.0 8.112716
8132 43671.0 0.111361
8133 -594462.5 -0.350595
8134 -802385.5 -0.418010
8139 37269.0 0.046664
[223 rows x 2 columns]
/tmp/ipykernel_3444/3840527632.py:91: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
<Figure size 500x500 with 0 Axes>
/tmp/ipykernel_3444/3840527632.py:138: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
# Filter for 5-digit NAICS and ownership code 5
real_nj_data_5dig = real_nj_data[
(real_nj_data['industry_code'].str.len() == 5) &
(real_nj_data['own_code'] == 5) &
(~real_nj_data['industry_code'].str.contains('-'))
]
# Prepare synthetic data - convert wage to same units and group by 5-digit NAICS
synth_nj_df_temp = synth_nj_df.copy()
synth_nj_df_temp['wage'] = synth_nj_df['wage'] * 1000
synth_nj_df_5dig = synth_nj_df_temp.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics5']
).agg({
'wage': 'sum'
}).reset_index()
# Convert to string first to handle any non-numeric values, then to int
synth_nj_df_5dig['naics5'] = synth_nj_df_5dig['naics5'].astype(str).str.strip().astype(int)
real_nj_data_5dig['industry_code'] = real_nj_data_5dig['industry_code'].astype(str).str.strip().astype(int)
# Get the intersection of common NAICS codes
common_codes = set(synth_nj_df_5dig['naics5']).intersection(set(real_nj_data_5dig['industry_code']))
# Filter synthetic data to only keep rows with common codes
synth_nj_df_5dig = synth_nj_df_5dig[synth_nj_df_5dig['naics5'].isin(common_codes)]
# Filter real data to only keep rows with common codes
real_nj_data_5dig = real_nj_data_5dig[real_nj_data_5dig['industry_code'].isin(common_codes)]
# Standardize column names and types
real_nj_data_5dig['cnty'] = real_nj_data_5dig['cnty'].astype(int)
real_nj_data_5dig['state'] = real_nj_data_5dig['state'].astype(int)
real_nj_data_5dig['naics5'] = real_nj_data_5dig['industry_code']
# Merge the dataframes on county
merged_df = pd.merge(
synth_nj_df_5dig,
real_nj_data_5dig,
on=['year', 'qtr', 'state', 'cnty', 'naics5']
)
# Calculate differences
merged_df['wage_diff'] = merged_df['wage'] - merged_df['total_qtrly_wages']
merged_df['rel_wage_diff'] = (merged_df['wage'] - merged_df['total_qtrly_wages']) / merged_df['total_qtrly_wages']
# Select relevant columns for the result
result_df_5dig = merged_df[[
'year', 'qtr', 'state', 'cnty', 'naics5',
'wage', 'total_qtrly_wages', 'wage_diff', 'rel_wage_diff'
]]
# Remove rows where real wages are zero to avoid division issues
result_df_5dig = result_df_5dig[result_df_5dig['total_qtrly_wages'] != 0]
/tmp/ipykernel_3444/783126966.py:19: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy real_nj_data_5dig['industry_code'] = real_nj_data_5dig['industry_code'].astype(str).str.strip().astype(int)
wageresultout(result_df_5dig, 'naics5')
Difference Statistics:
wage_diff rel_wage_diff
count 4.624000e+03 4624.000000
mean 1.519388e+06 3.281722
std 4.410691e+07 99.628437
min -5.097223e+08 -0.998158
25% -6.999670e+05 -0.342593
50% -2.755950e+04 -0.034353
75% 5.719885e+05 0.397085
max 2.045610e+09 6478.504822
Median Differences by NAICS Sector:
wage_diff rel_wage_diff
naics5
11411 151963.0 0.133884
11511 24418.0 0.616897
11521 187704.5 3.618429
21232 -131364.0 -0.127134
22111 -3757212.0 -0.465565
... ... ...
81391 -402675.0 -0.381606
81392 -139861.0 -0.287271
81393 -32742.5 -0.043289
81394 -8910.5 0.014091
81399 460599.0 0.390070
[463 rows x 2 columns]
/tmp/ipykernel_3444/3840527632.py:91: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
<Figure size 500x500 with 0 Axes>
/tmp/ipykernel_3444/3840527632.py:138: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
# Filter for 6-digit NAICS and ownership code 5
real_nj_data_6dig = real_nj_data[
(real_nj_data['industry_code'].str.len() == 6) &
(real_nj_data['own_code'] == 5)
]
# Prepare synthetic data - convert wage to same units and group by 6-digit NAICS
synth_nj_df_temp = synth_nj_df.copy()
synth_nj_df_temp['wage'] = synth_nj_df['wage'] * 1000
synth_nj_df_6dig = synth_nj_df_temp.groupby(
['year', 'qtr', 'state', 'cnty', 'own', 'naics']
).agg({
'wage': 'sum'
}).reset_index()
# Convert to string first to handle any non-numeric values, then to int
synth_nj_df_6dig['naics'] = synth_nj_df_6dig['naics'].astype(str).str.strip().astype(int)
real_nj_data_6dig['industry_code'] = real_nj_data_6dig['industry_code'].astype(str).str.strip().astype(int)
# Get the intersection of common NAICS codes
common_codes = set(synth_nj_df_6dig['naics']).intersection(set(real_nj_data_6dig['industry_code']))
# Filter synthetic data to only keep rows with common codes
synth_nj_df_6dig = synth_nj_df_6dig[synth_nj_df_6dig['naics'].isin(common_codes)]
# Filter real data to only keep rows with common codes
real_nj_data_6dig = real_nj_data_6dig[real_nj_data_6dig['industry_code'].isin(common_codes)]
# Standardize column names and types
real_nj_data_6dig['cnty'] = real_nj_data_6dig['cnty'].astype(int)
real_nj_data_6dig['state'] = real_nj_data_6dig['state'].astype(int)
real_nj_data_6dig['naics'] = real_nj_data_6dig['industry_code']
# Merge the dataframes
merged_df = pd.merge(
synth_nj_df_6dig,
real_nj_data_6dig,
on=['year', 'qtr', 'state', 'cnty', 'naics']
)
# Calculate differences
merged_df['wage_diff'] = merged_df['wage'] - merged_df['total_qtrly_wages']
merged_df['rel_wage_diff'] = (merged_df['wage'] - merged_df['total_qtrly_wages']) / merged_df['total_qtrly_wages']
# Select relevant columns for the result
result_df_6dig = merged_df[[
'year', 'qtr', 'state', 'cnty', 'naics',
'wage', 'total_qtrly_wages', 'wage_diff', 'rel_wage_diff'
]]
# Remove rows where real wages are zero to avoid division issues
result_df_6dig = result_df_6dig[result_df_6dig['total_qtrly_wages'] != 0]
/tmp/ipykernel_3444/2025247857.py:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy real_nj_data_6dig['industry_code'] = real_nj_data_6dig['industry_code'].astype(str).str.strip().astype(int)
wageresultout(result_df_6dig, 'naics')
Difference Statistics:
wage_diff rel_wage_diff
count 4.772000e+03 4772.000000
mean 9.824924e+05 2.667257
std 3.044333e+07 96.068274
min -5.097223e+08 -0.998158
25% -6.122455e+05 -0.385826
50% -3.465650e+04 -0.047152
75% 4.494778e+05 0.364487
max 1.125154e+09 6478.504822
Median Differences by NAICS Sector:
wage_diff rel_wage_diff
naics
114111 -3177.0 -0.015637
115115 24418.0 0.616897
115210 187704.5 3.618429
221112 -1550125.0 -0.370212
221122 788277.0 0.272127
... ... ...
813910 -402675.0 -0.381606
813920 -139861.0 -0.287271
813930 -32742.5 -0.043289
813940 -8910.5 0.014091
813990 460599.0 0.390070
[549 rows x 2 columns]
/tmp/ipykernel_3444/3840527632.py:91: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'
<Figure size 500x500 with 0 Axes>
/tmp/ipykernel_3444/3840527632.py:138: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
label=f'y = {model.params[1]:.2f}x + {model.params[0]:.2f}\nR² = {model.rsquared:.2f}'